PEYMA: A Tagged Corpus for Persian Named Entities

نویسندگان

  • Mahsa Sadat Shahshahani
  • Mahdi Mohseni
  • Azadeh Shakery
  • Heshaam Faili
چکیده

The goal in the named entity recognition task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules. Using these guidelines, all words in documents can be labeled as person, location, organization, time, date, percent, currency, or other (words that are not in any other 7 classes). We use IOB encoding for annotating named entities in documents, like in most of the existing English standard NER datasets. Using this encoding, the first token of a named entity will label with B, next tokens of it (if exist) will label with I. Other words which are not part of any named entity, will label with O. Final corpus consists of 709 documents, which includes 302530 tokens. 41148 tokens out of these tokens are labeled as named entity and the others are labeled as O. In order to determine inter-annotator agreement, 160 documents were labeled by different annotators. Kappa statistic was estimated as 95% using words that are labeled as named entities. After creating the dataset, we use it to design a hybrid system for named entity recognition. Our hybrid system consists of a rule-based and a statistical part. The rule-based part consists of lists of some frequent named entities as well as some regular expressions based on Persian linguistic rules to identify named entities, and the statistical part is based on conditional random fields model, which is a typical method for modeling sequence labeling problems and it is frequently used for NER task in other languages. As in recent years deep learning has become a hot topic and it is widely used in many natural language processing tasks, we also create a system based on deep learning using LSTM neural networks. Our results indicate that using the proposed hybrid system (including statistical and list-based model), we can reach 84% in terms of F1 measure for seven labels: person, location, organization, date, time, percent and currency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

Corefrence resolution with deep learning in the Persian Labnguage

Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1801.09936  شماره 

صفحات  -

تاریخ انتشار 2018